Adversarial Attack

A comprehensive survey can be found here.

Terminology:

  • black-box/white-box attack: the adversarial example is generated with or without knowing the prior knowledge of the target model.

  • targeted/non-targeted attack: whether predicting a specific label for the adversarial example.

  • universal perturbation: fool a given model on any image with high probability.

Attack

  1. Backward Update

    • add imperceptible distortion and increase the classification loss

    • universal adversarial perturbation: learn a residual perturbation that works on most clean images

  2. Forward Update

Defense

  1. Use modified training samples during training or modified test samples during testing

  2. Modify network: model parameters regularization, add a layer/module

  3. Adversarial example detector: classify an example as adversarial or clean based on certain statistics

New perspective

Adversarial examples are not bugs, they are features.